Decision Trees

Decision trees are tree like model or a flowchart that splits the data based upon consequences, events or through any decisions

below image shows a graphical representation on how data is split into decision trees

The below is also an example of Classification tree

Classification tree

DecisionTrees.png

Tree Method

The data below is very much suited for a tree method , if your question is whether your friend will come to play or not

Datasuitedfortree.png

So in the decision tree we have the following elements :-

* The root - which is the first split or top of our tree

* The nodes - split of the value of a certain attribute

* The edges - outcome of the split to next node , these are also called internal nodes

* The leaf - which is the end point of each nodes , these are the terminal nodes that predict the outcome

Regression Tree

The below is an example of regression tree

RegressionTree.png

How to grow a tree is best manner ?

Now lets consider the following example Dataset 'D'

ExxampleDatasetD.png

from the dataset above we can understand that the feature Y splits the Class clearly , hence that it should be the root our tree , but since this is a small dataset we can easily identify the splits , but in case of large of Dataset we use recursive binary splitting or the Greeedy approach to find the best fit in our feature

Splits using Greedy approach (Regression Tree)

Given below gives us an idea on the greedy approach to find the best splits

GreedyApproach.png

Let's now see the steps to find the best split for our dataset and grow our regression tree

* Let the cube represent each feature in our Dataset , in our example D we take the features X,Y,Z and plot the points

* Now we have to find our first split , the root of our tree ,(the violet eqn gives us the math) , we split the data along a particular split such that the sum of RSS(Residual sum of squares) in both the regions tend to be minimum In our example D a split along the feature Y gives us an RSS value of zero but X and Z gives us RSS greater than zero , hence the greedy approach takes the split along Y as root of our tree

* After making our first split we get two regions say R1 and R2 , we follow the similar step in R1 and R2 and continue to grow our tree

* In case of Regression we can stop when there is one or five points in your leaf region , you can stop as per your convenience

* So our regression value would be the mean value from our region (or)

* We can predict the value of y from each region using the cross validation of linear Regression model

Splits using Greedy approach (Classifiaction Tree img:1 )

ClassifiactionTree.png

Let's now see the steps to find the best split for our dataset and grow our classification tree

* So the first best split is among the feature is taken such that the regions after the spilt should have a sum of low error rate. In our example dataset D , a split along Y makes it into two regions (1,1) and (0,0) thus the sum of error rate is minimum (Zero in our case)

* So by following this method we can grow our binary classification tree using greedy approach

* We can stop our tree until when our leaf contains region of single class

* So now we can predict the class of our new data by using the majority votes in our leaf region

Splits using Greedy approach (Classifiaction Tree img:2 )

ClassifiactionTree2.png

Generaization of classification tree

We know that the Decision tree is predominantly used for classifiaction

* So there are many generalization methods through which we can determine the splits for our tree , the most common three are

1) Error Rate method

2) Entropy

3) Gini Index

The formulas for the above 3 are given below image

Generalization formulas

1) here P suffix R denoted the probability

GeneralizationsofTree.png

Bootstrap Aggregation ( Bagging )

Definition from Wiki :

Why do we need Bootstrap Aggregation ?

Bootstrap aggregating, also called bagging (from bootstrap aggregating), is a machine learning ensemble meta-algorithm designed to improve the stability and accuracy of machine learning algorithms used in statistical classification and regression. It also reduces variance and helps to avoid overfitting. Although it is usually applied to decision tree methods, it can be used with any type of method. Bagging is a special case of the model averaging approach.

How does it acheive the same ?

Given a standard training set 'D' of size 'n', bagging generates 'm' new training sets , each of size 'n′, by sampling from D uniformly and with replacement. By sampling with replacement, some observations may be repeated in each .

If n′=n, then for large n the set is expected to have the fraction (1 - 1/e) (≈63.2%) of the unique examples of D, the rest being duplicates.

This kind of sample is known as a bootstrap sample. Then, m models are fitted using the above m bootstrap samples and combined by averaging the output (for regression) or voting (for classification).

We can find from below formulas the relation with variance

here 'E' is the expected value

Y is the predicted value for x , from a bootstrap model

y is the true value for x

z is the average , used to aggregate from predicted values

BootstrapAggregation.png

from the last equation of the below model we can see that error rate for the average E(Z-y)pow2 depends on the number of bootstrap models . so if m tends to infinity , our error rate will tend to become zero

BootstrapAggregation2.png

In [ ]: